import requests
payload = {
'api_key': 'API_KEY',
'query': 'iphone 15 charger',
's': 'price-asc-rank'
}
response = requests.get('https://api.scraperapi.com/structured/amazon/search',
params=payload).json()Data Prep / EDA
Where the data source, processing, and visualization (EDA) is presented.
Data Collection
Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.
The jupyter notebook code for the web scraping can be found here.
Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).
The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.
| type | position | asin | name | image | has_prime | is_best_seller | is_amazon_choice | is_limited_deal | stars | total_reviews | url | availability_quantity | spec | price_string | price_symbol | price | original_price | section_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | search_product | 17 | B06ZY43PDR | Amazon.com Gift Card in a Birthday Pop-Up Box | https://m.media-amazon.com/images/I/71VoEvoetO... | True | False | False | False | 4.9 | 53293.0 | https://www.amazon.com/Amazon-com-Gift-Card-Bi... | NaN | {} | $50.00$2,000.00 | $ | 50.00 | NaN | NaN |
| 1 | search_product | 48 | B0CRKFR1KX | Thanks For Being My Sister Card - Funny Annive... | https://m.media-amazon.com/images/I/71dYDteQ+2... | True | False | False | False | NaN | NaN | https://www.amazon.com/VLPGifts-Thanks-Being-S... | 2.0 | {} | $4.95 | $ | 4.95 | NaN | NaN |
| 2 | search_product | 3 | B08FP1C33H | American Greetings Rainbow Party Supplies, Mul... | https://m.media-amazon.com/images/I/71IHwT+D63... | True | False | False | False | 4.7 | 979.0 | https://www.amazon.com/American-Greetings-Mult... | NaN | {} | $7.26 | $ | 7.26 | {'price_string': '$8.49', 'price_symbol': '$',... | NaN |
| 3 | search_product | 41 | B07P43CTD4 | HandFan Portable Neck Fan, USB Rechargeable Pe... | https://m.media-amazon.com/images/I/71Y8KkiDjc... | True | False | False | False | 4.7 | 847.0 | https://www.amazon.com/HandFan-Personal-Neckla... | NaN | {} | $16.99 | $ | 16.99 | NaN | NaN |
| 4 | search_product | 3 | B0CKNJTTWY | DR770x-2ch LTE 4G Cloud Dash cam Front and Rea... | https://m.media-amazon.com/images/I/51RpBecono... | False | False | False | False | NaN | NaN | https://www.amazon.com/DR770x-2ch-Cloud-Front-... | NaN | {} | $1,056.79 | $ | 1056.79 | NaN | NaN |
| asin | title | imgUrl | productURL | stars | reviews | price | listPrice | category_id | isBestSeller | boughtInLastMonth | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B08QC9N9G7 | Girl Wireless Gaming Headset, Cute Cat Ear Hea... | https://m.media-amazon.com/images/I/61q2tV9QNF... | https://www.amazon.com/dp/B08QC9N9G7 | 4.3 | 0 | 18.99 | 0.0 | 263 | False | 0 |
| 1 | B00DUIFDJM | Rit Dyes tan Liquid 8 oz. Bottle [Pack of 4 ] | https://m.media-amazon.com/images/I/41gRl9aGTy... | https://www.amazon.com/dp/B00DUIFDJM | 5.0 | 1 | 26.04 | 0.0 | 2 | False | 0 |
| 2 | B014QD012S | Acrylic Felt Fabric RED / 72" Wide/Sold by The... | https://m.media-amazon.com/images/I/5112oTr2k2... | https://www.amazon.com/dp/B014QD012S | 4.6 | 0 | 12.89 | 0.0 | 7 | False | 0 |
| 3 | B0977J6NX7 | Boys Short Sleeve Logo Tee Shirt (5, Heritage ... | https://m.media-amazon.com/images/I/61wlDiTgxZ... | https://www.amazon.com/dp/B0977J6NX7 | 4.1 | 8 | 19.99 | 23.0 | 84 | False | 0 |
| 4 | B09TX89V9G | Navy Blue Birthday Party Decorations Blue Conf... | https://m.media-amazon.com/images/I/81FF-W6boY... | https://www.amazon.com/dp/B09TX89V9G | 4.6 | 0 | 25.99 | 0.0 | 13 | False | 100 |
Data Cleaning
The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.
The steps to clean the web-scaped data were:
- Add
date_scrapedcolumn - Remove unecessary columns:
type,position,has_prime,is_amazon_choice,is_limited_deal,availability_quantity,spec,price_string,price_symbol,section_name - Expand and fix
original_price - Rename columns to match standard snake case for merging both datasets
- Drop rows with no asin or name or price
- Fill NaN
reviewscolumn with 0
The steps to clean the Kaggle data were:
- Add
date_scrapedcolumn - Remove unecessary columns
boughtInLastMonth - Drop rows with any NaNs
- Fix
list_price0 to be instead equal toprice - Change
category_idto actual category by usingcategorytable - Rename columns to match standard snake case for merging both datasets
And then, after they were concatenated, the steps to clean were:
- Remove duplicates (by asin + date scraped)
- Rename columns
The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):
| Asin | Name | Image Url | Is Best Seller | Stars | Reviews | Url | Price | Date Scraped | List Price | Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B09M8G89XT | 10Ft Micro-USB Charger Cords Cables for Samsun... | https://m.media-amazon.com/images/I/51s3ZhSfnW... | False | 4.6 | 0.0 | https://www.amazon.com/dp/B09M8G89XT | 9.99 | 2023-11-01 | 9.99 | Televisions & Video Products |
| 1 | B00112DX8M | CoverGirl Eye Enhancers 1 Kit Shadow - Snow Bl... | https://m.media-amazon.com/images/I/61xpnTBEkF... | False | 4.4 | 0.0 | https://www.amazon.com/dp/B00112DX8M | 5.92 | 2023-11-01 | 6.99 | Makeup |
| 2 | B0CBKHDNFC | 30Pcs Antique Box Corner Protectors, Decorativ... | https://m.media-amazon.com/images/I/81LuRJsM1v... | False | 0.0 | 0.0 | https://www.amazon.com/dp/B0CBKHDNFC | 11.49 | 2023-11-01 | 11.49 | Baby Safety Products |
| 3 | B004UDMDWG | SEGA INITIAL D STREET STAGE PSP the Best for P... | https://m.media-amazon.com/images/I/61l6h7YjxM... | False | 4.3 | 0.0 | https://www.amazon.com/dp/B004UDMDWG | 0.00 | 2023-11-01 | 0.00 | Sony PSP Games, Consoles & Accessories |
| 4 | B000CMJ16A | Grote 47053 - Chrome Plated Rectangular Cleara... | https://m.media-amazon.com/images/I/81tYYrhQjN... | False | 3.9 | 0.0 | https://www.amazon.com/dp/B000CMJ16A | 9.15 | 2023-11-01 | 10.42 | Heavy Duty & Commercial Vehicle Equipment |
The code for the data cleaning can be found here.
Data Preprocessing / Visualization
Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.